Classification of Multilingual Mathematical Papers in DML-CZ

نویسندگان

  • Petr Sojka
  • Radim Rehurek
چکیده

The growth of digital repositories of scientific documents is speed-ed up by various digitisation activities. Almost all papers of mathematical journals are reviewed by either Mathematical Reviews or ZentralBlatt Math, summing up to more than 2.000.000 entries. In the paper we discuss possibilities and experiments we did on the data of Czech Digital Mathematics Library, DML-CZ with the goal of developing novel scalable methods of document classification and retrieval of multilingual mathematical papers. 1 Motivation – Project of Digital Mathematics Library You always admire what you really don’t understand. (Blaise Pascal) Mathematicians from all over the world dream of World Digital Mathematics Library [1], where (almost) all of reviewed mathematical papers in all languages will be stored, indexed and searchable with the today’s leading edge information retrieval machinery. A good resources towards this goals–in addition to the publisher’s digital libraries–are twofold: 1. ‘local’ repositories of digitised papers as NUMDAM [2]1, DML-CZ [3]2 or born-digital archives CEDRAM [4]3), arXiv.org>math4 2. two review services for the mathematical community: both ZentrallBlatt Math5 andMathematical Reviews6 have more than 2.000.000 entries (paper metadata and reviews) from more than 2300 mathematical serials and journals. Google Scholar7 is becoming useful in the meantime, but lacks specialisedmath search and metadata guessed from parsing crawled papers are of low quality (compared to the controlled repositories). Both review services agreed on the supported Mathematics Subject Classification (MSC) scheme8, and currently used MSC 2000 is being revised for use in 1 http://www.numdam.org 2 http://www.dml. z 3 http://www. edram.org 4 http://arxiv.org/ar hive/math 5 http://www.zblmath.fiz-karlsruhe.de/MATH/ 6 http://www.ams.org/mr-database 7 http://s holar.google. om 8 http://www.ams.org/ms / Petr Sojka, Aleš Horák (Eds.): Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2007, pp. 89–96, 2007. c © Masaryk University, Brno 2007 90 Petr Sojka, Radim Řehůřek 2010 (MSC2010). Most journals request classification being used already by authors when submitting journals for publication; however, most of retrodigitised papers published before MSC 1990 are not classified by MSC in the databases. Within the DML-CZ project we have investigated possibilities to classify (retrodigitised) mathematical papers by machine learning techniques, to enrich math searching capabilities and to allow semantically related search. As text of scanned pages is usually optically recognised, machine learning algorithms may use not only metadata (and reviews, if any), but also full text. Interesting question to pose is to find to which extent mathematical formulae are important for classification, document similarity measures, and search. 2 Data Preprocessing We run carelessly to the precipice, after we have put something before us to prevent us seeing it. (Blaise Pascal) There are many modelling techniques for given classification task in the area of pattern recognition. To design a classifier, we have to choose measurable features. These features should be as discriminative as possible with regard to the pattern of interest. Most of the methods use bag of words representation of a document. There are methods such as Latent Semantic Analysis (LSA), that try to find main document topics based on word co-occurences in documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From Pixels and Minds to the Mathematical Knowledge in a Digital Library

Experience in setting up a workflow from scanned images of mathematical papers into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with description of all main production steps. DML-CZ has recently been launched to public with more than 100,000 digitized pages.

متن کامل

Automated Classification and Categorization of Mathematical Knowledge

There is a common Mathematics Subject Classification (MSC) System used for categorizing mathematical papers and knowledge. We present results of machine learning of the MSC on full texts of papers in the mathematical digital libraries DML-CZ and NUMDAM. The F1-measure achieved on classification task of top-level MSC categories exceeds 89%. We describe and evaluate our methods for measuring the ...

متن کامل

An Experience with Building Digital Open Access Repository DML-CZ

A succesfully built institutional or community repository (e.g. set of workflows) needs a coordinated effort of librarians, IT specialists and representatives of users – content specialists. We will explain and discuss design, technical a political decisions behind building the Czech Digital Mathematics Library DML-CZ (http://dml.cz) in the context of other succesfull thematical community proje...

متن کامل

Digitization Workflow in the Czech Digital Mathematics Library

Experience in setting up a workflow from scanned images of mathematical writings into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with detailed description of production steps involving scanned image processing and optical character recognition. Experience gained, lessons lea...

متن کامل

Towards Machine-Actionable Modules of a Digital Mathematics Library - The Example of DML-CZ

Publishing and archiving mathematical literature presents its own sets of problems. Reaching the goal of building global digital mathematics library (DML), smaller DMLs play an inevitable role in collecting, validating, digitizing and checking data from smaller publishers. In this paper, we overview the technical challenges of building a machineactionable set of modules we have developed over a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007